In [ ]:
import pandas as pd

In [ ]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
try:
    import seaborn
except ImportError:
    pass

Data structures

Pandas does this through two fundamental object types, both built upon NumPy arrays: the Series object, and the DataFrame object.

Series

A Series is a basic holder for one-dimensional labeled data. It can be created much as a NumPy array is created:


In [ ]:
s = pd.Series([0.1, 0.2, 0.3, 0.4])
s

Attributes of a Series: index and values

The series has a built-in concept of an index, which by default is the numbers 0 through N - 1


In [ ]:
s.index

You can access the underlying numpy array representation with the .values attribute:


In [ ]:
s.values

We can access series values via the index, just like for NumPy arrays:


In [ ]:
s[0]

Unlike the NumPy array, though, this index can be something other than integers:


In [ ]:
s2 = pd.Series(np.arange(4), index=['a', 'b', 'c', 'd'])
s2

In [ ]:
s2['c']

In this way, a Series object can be thought of as similar to an ordered dictionary mapping one typed value to another typed value.

In fact, it's possible to construct a series directly from a Python dictionary:


In [ ]:
pop_dict = {'Germany': 81.3, 
            'Belgium': 11.3, 
            'France': 64.3, 
            'United Kingdom': 64.9, 
            'Netherlands': 16.9}
population = pd.Series(pop_dict)
population

We can index the populations like a dict as expected:


In [ ]:
population['France']

but with the power of numpy arrays:


In [ ]:
population * 1000

Many things we have seen for numpy, can also be used with pandas objects.

Slicing:


In [ ]:
population['Belgium':'Germany']

Fancy indexing, like indexing with a list or boolean indexing:


In [ ]:
population[['France', 'Netherlands']]

In [ ]:
population[population > 20]

Element-wise operations:


In [ ]:
population / 100

A range of methods:


In [ ]:
population.mean()
EXERCISE: Calculate the population numbers relative to Belgium

In [ ]:
population / population['Belgium'].mean()

In [ ]:


In [ ]:

Alignment!

Only, pay attention to alignment: operations between series will align on the index:


In [ ]:
s1 = population[['Belgium', 'France']]
s2 = population[['France', 'Germany']]

In [ ]:
s1

In [ ]:
s2

In [ ]:
s1 + s2

DataFrames: Multi-dimensional Data

A DataFrame is a tablular data structure (multi-dimensional object to hold labeled data) comprised of rows and columns, akin to a spreadsheet, database table, or R's data.frame object. You can think of it as multiple Series object which share the same index.

One of the most common ways of creating a dataframe is from a dictionary of arrays or lists.

Note that in the IPython notebook, the dataframe will display in a rich HTML view:


In [ ]:
data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
        'population': [11.3, 64.3, 81.3, 16.9, 64.9],
        'area': [30510, 671308, 357050, 41526, 244820],
        'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']}
countries = pd.DataFrame(data)
countries

Attributes of the DataFrame

A DataFrame has besides a index attribute, also a columns attribute:


In [ ]:
countries.index

In [ ]:
countries.columns

To check the data types of the different columns:


In [ ]:
countries.dtypes

An overview of that information can be given with the info() method:


In [ ]:
countries.info()

Also a DataFrame has a values attribute, but attention: when you have heterogeneous data, all values will be upcasted:


In [ ]:
countries.values

If we don't like what the index looks like, we can reset it and set one of our columns:


In [ ]:
countries = countries.set_index('country')
countries

To access a Series representing a column in the data, use typical indexing syntax:


In [ ]:
countries['area']

As you play around with DataFrames, you'll notice that many operations which work on NumPy arrays will also work on dataframes.

For example there's arithmetic. Let's compute density of each country:


In [ ]:
countries['population']*1000000 / countries['area']

Adding a new column to the dataframe is very simple:


In [ ]:
countries['density'] = countries['population']*1000000 / countries['area']
countries

We can use masking the way we did in NumPy to select certain data:


In [ ]:
countries[countries['density'] > 300]

And we can do things like sorting the items in the array, and indexing to take the first two rows:


In [ ]:
countries.sort('density', ascending=False)

One useful method to use is the describe method, which computes summary statistics for each column:


In [ ]:
countries.describe()

The plot method can be used to quickly visualize the data in different ways:


In [ ]:
countries.plot()

However, for this dataset, it does not say that much:


In [ ]:
countries['population'].plot(kind='bar')

You can play with the kind keyword: 'line', 'bar', 'hist', 'density', 'area', 'pie', 'scatter', 'hexbin'

Importing and exporting data

A wide range of input/output formats are natively supported by pandas:

  • CSV, text
  • SQL database
  • Excel
  • HDF5
  • json
  • html
  • pickle
  • ...

In [ ]:
pd.read

In [ ]:
states.to

Other features

  • Working with missing data (.dropna(), pd.isnull())
  • Merging and joining (concat, join)
  • Grouping: groupby functionality
  • Reshaping (stack, pivot)
  • Time series manipulation (resampling, timezones, ..)
  • Easy plotting

There are many, many more interesting operations that can be done on Series and DataFrame objects, but rather than continue using this toy data, we'll instead move to a real-world example, and illustrate some of the advanced concepts along the way.

See the next notebooks!

Acknowledgement

© 2015, Stijn Van Hoey and Joris Van den Bossche (mailto:stijnvanhoey@gmail.com, mailto:jorisvandenbossche@gmail.com). Licensed under CC BY 4.0 Creative Commons

This notebook is partly based on material of Jake Vanderplas (https://github.com/jakevdp/OsloWorkshop2014).